Each encoder in the Transformer consists of two main parts:
Multi-head Self-Attention followed by Add & Normalize: This is the self-attention mechanism we discussed earlier, where the input sequence attends to all positions in itself. After the attention scores are computed and used to produce an output, a residual connection (i.e., the "add" operation) is added, and then layer normalization is applied.
Position-wise Feed-Forward Networks followed by Add & Normalize: This consists of two linear transformations with a ReLU activation in between. Just like in the self-attention mechanism, after the feed-forward network, a residual connection is added, followed by layer normalization.
Assuming the feed-forward network consists of two linear layers with weights W1​ and W2​, and biases b1​ and b2​, and using ReLU as the activation function, it can be represented as:
So, the output of the encoder, after processing the input matrix X (with positional encodings), is Output_of_encoder. If there are multiple encoder layers in the Transformer, this output will serve as the input X for the next encoder layer.
Get the latest updates, exclusive content and special offers delivered directly to your mailbox. Subscribe now!
ClassFlame – Where Learning Meets Conversation! offers conversational-style books in Computer Science, Mathematics, AI, and ML, making complex subjects accessible and engaging through interactive learning and expertly curated content.